feat(cache): multi-slot LRU MRU partial block cache#1149
Conversation
Server-side snapshot differencing via CacheRateTracker: stores the last 90 snapshots of cumulative cache counters (10s intervals, 15 min window) and computes rates via start/end differencing. Zero hot-path changes — snapshots are lazy, driven by dashboard polling cadence. New metrics exposed in /api/stats under cache_observability: - prefix_hit_rate (cumulative + windowed) - eviction count, ssd_hot_rate - per-model and weighted aggregate across models Dashboard: new "Cache Breakdown" card below Average Speed showing hit rate, evictions, and hot cache hits. Session-only (hidden in All-Time view since counters reset on model reload).
The new disk_max aggregation reads get_ssd_cache_max_size_bytes which the existing tests left unconfigured, producing a MagicMock that raises on max(MagicMock, int). Also add the missing max_size_bytes key to the expected model payloads.
The paged SSD cache only persists full block_size blocks; the trailing
sub-block tail (e.g. 139 of 256 tokens) is otherwise re-prefilled on
every repeat request. For Kimi K2.5-class models this adds ~1.5-2s
of avoidable TTFT per submission of an identical prompt.
Add a single-slot in-memory stash for that tail. After every
store_cache that produces a trailing partial we keep its KV state under
the parent block's hash. The next admission whose remaining_tokens
start with the stashed tokens splices the partial onto the reconstructed
cache and shrinks remaining_tokens by the partial's length, eliminating
the tail prefill.
This is a from-scratch rewrite of the archived feat/mru-partial-block-
cache branch (now archive/mru-partial-block-cache-v1). The original
landed three structural bugs that the test suite never exercised:
1. The duck-typed splice gate (hasattr(cache_obj, 'keys') and
hasattr(cache_obj, 'offset')) misclassified RotatingKVCache as
sliceable. RotatingKVCache HAS those attributes, so the gate
would concatenate the full rotating-window state onto the new
request's cache, blowing past max_size and leaving _idx stale.
Hybrid models (Gemma 3, Mistral, anything with sliding window)
would have been silently corrupted on every repeat.
2. The store-side extraction passed is_last_block=True, which makes
_extract_block_tensor_slice return the *full state* (not a token
slice) for non-sliceable layers. Wrong intent for partial
extraction; compounded jundot#1.
3. The splice's try/except wrapped the whole layer loop, so a
concatenate failure on layer N>0 left layers 0..N-1 already
mutated (offset += n_partial, keys/values overwritten) while
the caller was told zero tokens were applied. Half-mutated
caches are silent generation corruption.
Companion bug in the original deferred-clear suppression: the
suppression had no upper bound, so a hot-prompt workload (each repeat
stashes a fresh MRU before the prior is consumed) could defer the
Metal cache clear forever, defeating the pool-bloat mitigation
(jundot#411).
Safety properties of the rewrite:
- Hybrid refusal. Stash and apply both gate on uniform layer
sliceability via CacheTypeRegistry.get_handler_by_class_name(...)
.supports_block_slicing. If any layer is non-sliceable
(RotatingKVCache, ArraysCache, etc.) the slot is left empty.
Splicing only the sliceable layers in a hybrid would create
per-layer offset skew at decode -- undefined behaviour at the
model level -- so refusal is the only correct policy.
- Transactional splice. apply_mru_partial runs in two phases.
Phase 1 materialises the replacement keys/values for every layer
without touching the cache; phase 2 commits the writes. A
concatenate failure during phase 1 returns (cache, remaining, 0)
with no layer mutated. The slot is evicted on failure so a
consistently-failing partial does not get re-attempted.
- Eviction on every miss kind. Parent-hash mismatch, token
mismatch, length mismatch, layer-count mismatch, splice failure
all clear the slot. A stale or mistargeted partial cannot
survive into a future apply.
- Bounded deferred-clear suppression. Each completion's
_cleanup_finished arms a one-shot
_mru_clear_suppression_available budget alongside the existing
_deferred_clear_at target. At the deadline, if the budget is
intact and the cache reports has_mru_partial(), the deadline is
pushed out by one more _DEFERRED_CLEAR_DELAY window and the
budget is spent. The next deadline fires regardless, bounding
total deferral at 2x _DEFERRED_CLEAR_DELAY (~10-40 ms today).
Patched against _deferred_clear_at, the post-jundot#557 gate -- the
original was patching the obsolete _deferred_clear_steps path.
Tests (25 new, all passing):
TestMRUPartialBlockCache (19) -- init state, stash semantics,
no-stash on block alignment, slot replacement on subsequent
store, parent-hash linkage, hybrid refusal (KVCache +
RotatingKVCache and pure rotating), real round-trip via
store_cache through apply for exact and prefix matches, every
eviction reason, no-op on empty remaining, layer-count mismatch
eviction, transactional rollback under a mocked mx.concatenate
failure on layer 1 (asserts no layer's offset/shape changed),
multi-turn correctness with existing_tokens > 0 and distinct
fill values to verify the right slice was captured.
TestHasMRUPartial (1) -- the public accessor used by the
scheduler reflects slot transitions.
TestMRUDeferredClearSuppression (5) -- budget armed by completion,
clear fires at deadline without MRU, suppressed once with MRU
(deadline pushed by exactly one DELAY), clear fires at second
deadline even if MRU still warm, fires immediately after MRU
eviction, fresh completion refreshes spent budget.
Suite results on this commit:
189 passed (tests/test_prefix_cache.py + tests/test_scheduler.py)
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Adversarial review of the prior commit (a642e04) caught five concrete holes the rewrite introduced or carried over from its predecessor. Each is fixed below with a regression test that fails on the prior commit and passes here. C1 — gate replaced registry lookup with explicit whitelist. _all_layers_sliceable consulted CacheTypeRegistry, whose get_handler_by_class_name falls through to DefaultCacheHandler for any class name without a registered handler. DefaultCacheHandler inherits from KVCacheHandler and reports supports_block_slicing=True. Several real non-sliceable types are mapped in _class_name_map but have no registered handler: - BatchRotatingKVCache (BATCH_ROTATING_KVCACHE enum, no handler) - BatchPoolingCache, PoolingCache (registered only when the deepseek_v4 patch is applied) The registry would silently classify these as sliceable, recreating exactly the silent-corruption hazard the rewrite was supposed to close, just from a different angle. The fix consults the existing KNOWN_SLICEABLE_CACHE_TYPES whitelist (the same list the rest of the scheduler trusts for snapshot-skip and partial-extraction decisions), promoted from a private alias in scheduler.py to a public constant on omlx.cache.type_registry so both modules share one source of truth. Test: test_refuse_stash_when_layer_falls_through_to_default_handler asserts the registry would have lied about BatchRotatingKVCache and the new gate refuses it. C2 — clear() wipes the MRU slot. BlockAwarePrefixCache.clear() reset _request_tables, _prefix_index, and the paged cache, but left _mru_partial alive. Scheduler.reset() and Scheduler._recover_from_cache_error() both route through clear() — meaning a stale partial would survive exactly the cache-corruption recovery path that exists *because* something was wrong. After such a recovery, a future request that happens to reproduce the same prompt prefix would get its compute_block_hash matching the partial's parent_hash and the splice would fire against a freshly-reconstructed cache. Test: test_clear_wipes_mru_partial. C3 — suppression budget arms only on transition from None. The prior commit's docstring claimed "total deferral bounded at 2x _DEFERRED_CLEAR_DELAY." False under hot-prompt repeats: every completion landing while a deferral was pending re-armed _mru_clear_suppression_available = True. Workloads whose completions arrive faster than _DEFERRED_CLEAR_DELAY (the very workload this feature targets) keep refreshing the budget after it's spent, deferring the clear forever and defeating the pool-bloat mitigation (jundot#411). Fix: arm the budget only when starting a new deferral epoch (_deferred_clear_at transitions from None). Subsequent completions in the same epoch may still extend the deadline (the jundot#557 invariant) but do not refresh the budget. One suppression per epoch, enforced. Test: test_completion_within_open_epoch_does_not_refresh_budget drives two sequential completions, simulates the budget being spent between them, and asserts the second completion extends the deadline but leaves the budget at False. The renamed test_new_epoch_completion_arms_budget pins the converse. H2 — global-vs-local indices are now classified, not heuristic'd. cache_uses_global_indices = (existing_tokens > 0 and cache_seq_len >= existing_tokens + 1) silently classified ambiguous lengths as "local." In multi-turn requests where cache_data was extracted at a boundary equalling the prior turn's length, the cache is global but cache_seq_len falls between local_len and global_end — the old predicate said "local" and the partial was sliced from the prefix region instead of the trailing tail. parent_hash still matched on the next request, and a future apply spliced wrong KV. Silent generation corruption — exactly the failure class the rewrite was supposed to close. Replaced with an explicit three-way classification: cache_seq_len >= partial_global_end -> global indices cache_seq_len == local_len -> local indices otherwise -> refuse to stash Refusing the ambiguous case is strictly safer than guessing. Test: test_refuse_stash_on_ambiguous_cache_layout drives the boundary directly with cache_seq_len strictly between local_len and global_end and asserts no stash. H3 — stash gated on paged_ssd_cache presence. In paged-SSD-only configurations (the only configuration this class supports for production reconstruction), reconstruct_cache returns None when paged_ssd_cache is None, which means apply_mru_partial is unreachable from the scheduler. Without the gate, the stash held a multi-MB tensor reference dead in memory until the next store_cache overwrote it — wasted memory scaling with model size. Test: test_no_stash_when_paged_ssd_cache_is_none. H4 — accounting divergence documented (no behaviour change). After a successful splice, cached_tokens is advanced by the partial length but shared_prefix_blocks is not (the partial is not a stored paged block). The relaxed invariant cached_tokens >= shared_prefix_blocks * block_size is now documented at the scheduler call site, with a guard against the most likely future misuse (indexing block_table.block_ids by shared_prefix_blocks while bounding the loop with cached_tokens). Per-test-class layout: TestMRUPartialBlockCache 18 -> 22 (+ C1, C2, H2, H3) TestHasMRUPartial 1 TestMRUDeferredClearSuppression 6 -> 7 (+ C3 invariant test; previous "fresh completion" test renamed to reflect the new contract) Suite results on this commit: 194 passed in 0.57s Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Add a docstring explaining how the MRU partial slot's memory cost
flows through the existing memory enforcement machinery, and a test
that pins the invariant the implicit accounting depends on.
Why this is documentation, not behaviour:
Tracing the budgeting paths in this codebase shows that all KV
memory enforcement reads from mx.get_active_memory():
- process_memory_enforcer.py:217,232 (process-level enforcer)
- scheduler.py:1567 (prefill mid-loop limit check)
- scheduler.py:3328 (prefill pre-flight peak check)
- scheduler.py:3377 (generation admission guard)
- scheduler.py:_periodic_clear_threshold_bytes (periodic clear)
- optimizations.py:65 (telemetry)
There is no separate up-front KV budget that the MRU could escape.
In paged-SSD-only mode (the only mode this codebase supports),
_calculate_max_blocks() returns a fixed 100k block-metadata count,
not a memory budget — paged blocks live on SSD, not GPU memory.
The estimator helpers (estimate_block_memory, estimate_prompt_kv_bytes)
are deltas computed against the current mx.get_active_memory()
baseline, which already includes the MRU.
_clone_tensor (prefix_cache.py:1207-1221) uses mx.copy(tensor),
producing real mx.array allocations. MLX counts these in active
memory automatically. So the MRU slot's ~one-block-worth of KV
(~17 MiB Kimi K2.5 / DeepSeek MLA, ~41 MiB Llama 3 70B full
attention) is already enforced against the same limits as the
in-flight request caches.
No behaviour change required. The user's "apples to apples"
intuition holds.
What this commit does add:
1. _MRUPartialBlock docstring gains a "Memory accounting" section
enumerating the enforcement paths and the implicit invariant
(kv_data holds mx.array instances). Future maintainers reading
the MRU code will see why no separate accounting hook exists
and not be tempted to add one.
2. test_kv_data_holds_mlx_arrays_for_active_memory_accounting
asserts the invariant directly. A "helpful" future change that
stored CPU-side copies (np.ndarray to dodge a perceived
GPU-memory cost) would silently escape every existing memory
limit and only manifest as system OOM under load. The test
fails fast if that regression is introduced.
Suite results on this commit:
195 passed in 0.25s (tests/test_prefix_cache.py + tests/test_scheduler.py)
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Sanity-checking the budgeting paths revealed a third memory-budget
layer worth mentioning in the MRU docstring: engine_pool's pre-load
admission gate (engine_pool.py:355-373) reserves a fraction of each
model's weight size as KV headroom and logs "Loading {model_id}
without KV headroom" when eviction can't free enough.
The MRU partial is one tenant of that headroom alongside the in-flight
prompt caches, but it is not separately reserved because at one
block_size of KV per cache instance (~17 MiB Kimi K2.5 / ~41 MiB
Llama 3 70B), the slot is dominated by the concurrent in-flight
caches the headroom was sized for.
Quantification across model classes:
Kimi K2.5 (MLA, ~200 GB quant): 25% headroom = ~50 GB,
MRU = 17 MiB → 0.0003%
Llama 3 70B (Q4, ~35 GB): 25% headroom = ~9 GB,
MRU = 41 MiB → 0.5%
Llama 3 8B (Q4, ~4.5 GB): 25% headroom = ~1.1 GB,
MRU = 10 MiB → 1%
Qwen 0.5B (~1 GB): 25% headroom = 256 MiB,
MRU = 5 MiB → 2%
The pre-load layer's granularity (gigabytes) makes the MRU partial
invisible at every model scale. The runtime enforcer catches any
overrun via mx.get_active_memory() regardless. Approach unchanged;
documentation is just more complete.
The 25% percentage itself is intentionally not quoted in the
docstring — it could change in engine_pool without invalidating the
MRU's accounting model.
Suite results on this commit:
195 passed in 0.23s
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Three cleanups surfaced by the simplify review. No behaviour change.
Doc bit-rot: drop file:line references from comments.
The MRU stack picked up several docstring/comment references that
cite specific scheduler.py line numbers and a hardcoded "100k"
block-metadata count. Both will rot — line numbers shift on any
edit above them, and the 100k constant lives in
_calculate_max_blocks() and could change without invalidating the
MRU's accounting model.
- _MRUPartialBlock docstring: enumerated scheduler.py:1567, 3328,
3377, _periodic_clear_threshold_bytes. Replaced with prose
naming the same gates symbolically. Dropped the "100k" magic
number; the doc now says "fixed block-metadata count" since the
specific number is irrelevant to the MRU's invariant.
- H4 accounting note in scheduler.py: referenced "logging at 2828,
2835" inside source. Replaced with prose ("the scheduler's
prefill-completion log lines downstream") so a future search by
log message still finds them after motion.
Test factory collapse:
tests/test_prefix_cache.py: _kv_layer and _rotating_layer were
near-identical (one differed only in the cache_type/class_name
string and a fill kwarg). Extracted shared _layer factory taking
class_name as a kwarg; the two existing helpers now delegate to
it. Keeps the call sites readable while removing the copy-paste.
Same factory naturally extends to BatchRotatingKVCache and other
cache types when those tests grow.
Findings deferred (separate PRs warranted):
- Per-block loop in store_cache (prefix_cache.py:553-555) shares
the cache_seq_len >= existing_tokens + 1 heuristic the H2 fix
retired in _update_mru_partial. Extracting a shared
_classify_cache_indexing helper and routing the per-block loop
through it would close the same hazard at its other site.
Needs its own safety analysis (when does the
cache_seq_len == existing_tokens boundary actually arise) and
regression tests scoped to the per-block path; out of scope for
the MRU branch.
- paged_cache.allocated_blocks.get(...) is a leaky abstraction at
9+ pre-existing sites in prefix_cache.py. Encapsulating it
behind a public PagedCacheManager.get_block_by_id() method is a
wider refactor that should not piggyback on MRU work.
- KNOWN_SLICEABLE_CACHE_TYPES → CacheType enum has a TurboQuant
caveat (_class_name_map collapses TurboQuantKVCache and
BatchTurboQuantKVCache to KVCACHE). Conversion needs a
deliberate decision on whether to lose the explicit gate
strings.
- Per-layer mx.concatenate dispatch in apply_mru_partial's phase 1
could potentially be batched. Per the prefill-perf principle,
this needs an M3 Ultra measurement before changing — both to
establish that the cost is significant and to confirm batched
dispatch is actually faster on the platform.
Suite results on this commit:
195 passed in 0.42s
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Follow-up to the MRU partial cache peer review. The H3 stash gate at
prefix_cache.py:806 and the canonical reconstruct guard at :1710 both
check ``self.paged_ssd_cache is None`` to decide whether reconstruct
can possibly return non-None. Co-locating the two predicates as a
single ``_can_reconstruct`` helper makes the lockstep explicit — a
future fetch path that bypasses PagedSSDCacheManager (alternate
backends, memory-only modes that detach from the manager) updates
exactly one predicate, not two.
Behaviour is unchanged. The clarification is in the docstring.
The MRU stash docstring previously said the slot is cleared when
"no SSD configured." That phrasing is misleading on
``hot_cache_only=True`` configurations (set via settings.json or
OMLX_HOT_CACHE_ONLY env): the manager IS present in that mode — only
the disk writer thread and directory init are skipped. The reconstruct
path still works because PagedSSDCacheManager.load_block_with_metadata
short-circuits to the hot tier without ever calling mx.load. In that
mode the MRU stash IS expected to populate, and the gate correctly
permits it.
The gate fires only when no PagedSSDCacheManager instance exists at
all — typically a test/dev scenario. The new docstring on
``_can_reconstruct`` enumerates the predicate's semantics and the
hot_cache_only case explicitly so a reader looking at either site
arrives at the same understanding.
Test changes:
- ``test_no_stash_when_paged_ssd_cache_is_none`` keeps its name and
behaviour (covers the no-manager case) but its docstring now
distinguishes that case from ``hot_cache_only=True``, where the
MRU IS expected to populate.
- New ``test_can_reconstruct_helper_reflects_manager_presence``
pins the predicate's contract directly.
Suite results on this commit:
197 passed in 0.38s
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Upgrade the MRU partial-block cache from a single slot to a bounded LRU dict keyed by parent_hash. Multiple concurrent "warm" partials coexist up to ``mru_partial_max_entries``; LRU eviction pops the oldest when capacity is reached. Single-slot mode could not absorb interleaving (multi-user / multi-conversation workloads); every ``store_cache`` overwrote the lone slot. On M3 Ultra — where prefill is firmly compute-bound at ~26 TFLOPS effective FP16 with no MatMul accelerator (see ``feedback_apple_silicon_perf.md``) — leaving prefill compute on the table for interleaved workloads is exactly the case that should not be deferred pending metrics. Data structure -------------- ``BlockAwarePrefixCache._mru_partials: OrderedDict[bytes | None, _MRUPartialBlock]`` mirrors the LRU pattern from ``PagedSSDCacheManager._hot_cache``. No internal lock — the prefix cache relies on the scheduler's single-threaded executor model (same as today's single slot). Public API stays compatible. ``has_mru_partial()`` returns ``bool(self._mru_partials)``; the scheduler's deferred-clear suppression budget reads the same boolean predicate it did before. ``apply_mru_partial()`` and ``_update_mru_partial()`` retain their signatures. Eviction discipline ------------------- - On stash: if key exists, pop and re-insert at tail; if over capacity, ``popitem(last=False)``. - On apply success: ``move_to_end(key)`` — promote to LRU tail. - On apply miss for a found key (token-prefix, layer-count, or splice failure): ``pop(key, None)`` — evict only that key. - On ``clear()`` (cache-corruption recovery): wipe the dict. - "No eligible tail" branches in ``_update_mru_partial`` no longer wipe the dict — they bare-return. A local "nothing to stash this time" signal is unrelated to the validity of other entries. This is the behavioural change that lets distinct-prefix entries coexist under interleaving. Freed-paged-block guard (new) ----------------------------- If ``block_table.block_ids`` is non-empty but the parent paged block has been freed between stash and apply, ``apply_mru_partial`` returns no-op rather than falling through to a ``None``-keyed dict lookup. That fall-through would falsely match a short-prompt entry against a request whose parent is just gone. The race is structurally new in multi-slot mode; single-slot tolerated it because there was only ever one entry to match against. Short-prompt entries (prefix < block_size, parent_hash=None) share one slot via the ``None`` key — same multi-tenant constraint as the single-slot design, but only for the short-prompt subset. Capacity & plumbing ------------------- ``mru_partial_max_entries`` threads from ``CacheSettings`` → ``--mru-partial-max-entries`` CLI flag → ``SchedulerConfig`` → both ``BlockAwarePrefixCache(...)`` construction sites (main at ``scheduler.py:804`` and SpecPrefill draft at ``scheduler.py:3473``). Default 4 matches the dflash ``max_entries`` precedent (PR jundot#1120). ``0`` disables stashing (silent fallback to "no MRU" behaviour, mirroring the ``hot_cache_max_size="0"`` convention). Memory worst-case at default 4: ~68 MiB MLA / ~165 MiB GQA per cache instance. With two cache instances (main + SpecPrefill draft), ~136 MiB / ~330 MiB total. All inside the engine pool's 25% KV headroom envelope. Documented in the ``_MRUPartialBlock`` docstring including the ``hot_cache_only=True`` coexistence note: the hot cache and MRU dict both live in the same envelope under that mode and should be tuned together. Test surface ------------ Existing single-slot tests adapted via a test-only ``_get_mru_partial`` helper (production class surface stays clean; tests are decoupled from the internal container shape). New ``TestMRUPartialMultiSlot`` covers the multi-slot mechanics: - ``test_distinct_prefixes_coexist_as_separate_entries`` - parameterized ``test_lru_capacity_bounds`` (evict-oldest-at-capacity + under-capacity-keeps-all) - ``test_apply_success_promotes_entry_to_lru_tail`` - ``test_max_entries_zero_disables_stashing`` - ``test_clear_mru_partials_wipes_only_partials`` - ``test_apply_noop_when_parent_block_freed`` (the new guard) - ``test_short_prompt_none_key_coexists_with_block_aligned_entry`` Existing tests adapted, with a few semantically inverted: - ``test_stash_replaced_on_subsequent_store`` → ``test_same_prefix_store_replaces_entry``: same prefix → same key → correct LRU put behaviour (replace). - ``test_stash_clears_when_subsequent_store_is_block_aligned`` → ``test_no_eligible_tail_does_not_evict_siblings``: the inverse behavioural change from single-slot. - ``test_apply_evicts_on_parent_hash_mismatch`` → ``test_apply_noop_on_parent_hash_mismatch_preserves_sibling``: no-op + sibling preservation, not eviction. - ``test_clear_wipes_mru_partial`` → ``test_clear_wipes_mru_partials``. Suite results on this commit: 211 passed: cache + scheduler + admin clear-symmetry tests 9 passed: test_settings.py::TestCacheSettings Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
The ``/api/ssd-cache/clear`` admin endpoint at ``omlx/admin/routes.py:3975`` wipes the SSD-backed paged blocks per loaded scheduler but did not touch ``BlockAwarePrefixCache._mru_partials``. Surviving MRU partials then chain from paged-block hashes whose KV bytes were just flushed, violating the operator's "drop all warm caches" intent. Under ``hot_cache_only=True`` (where the hot tier IS the only persistent store) the same hazard would also apply to PR jundot#1183's forthcoming ``/api/hot-cache/clear`` endpoint when it lands; that wiring is a one-line follow-up at the same loop using the same method. Wire ``block_aware_cache.clear_mru_partials()`` into the per-scheduler loop alongside ``ssd_manager.clear()``, with the same defensive try/except wrapper. A failure clearing one scheduler's MRU does not prevent siblings from being cleared. The standalone ``clear_mru_partials()`` method (added in the previous commit, see ``omlx/cache/prefix_cache.py``) is the public seam. Its own unit coverage lives in ``TestMRUPartialMultiSlot::test_clear_mru_partials_wipes_only_partials``; this commit adds two endpoint-level tests in ``TestClearSSDCacheAlsoWipesMRUPartials`` that pin the wiring: - ``test_endpoint_calls_clear_mru_partials_on_each_scheduler`` confirms both ``ssd_manager.clear()`` and ``block_aware_cache.clear_mru_partials()`` fire for every loaded scheduler. - ``test_mru_clear_failure_does_not_block_other_scheduler`` pins the defensive try/except: an exception in one scheduler's clear path must not stop the loop. Suite results on this commit: 62 passed: tests/test_admin_api_key.py (1 pre-existing unrelated failure remains) Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
a0f24fb to
cce198f
Compare
Plug the multi-slot MRU partial cache into the same observability surface that jundot#1183 established for the prefix and hot/disk tiers. Without these counters, operators have no signal for whether ``--mru-partial-max-entries`` is tuned right; with them, the dashboard answers "is the MRU paying off" with the same shape it answers for memory and disk hit rates. Counters (cumulative, on PrefixCacheStats) ------------------------------------------ - ``mru_partial_stashes`` — every successful entry write, including same-key replacements. Same-key replacement does NOT count as an eviction; operator sees stash payoff via the hits/stashes ratio. - ``mru_partial_hits`` — every successful splice via apply_mru_partial. - ``mru_partial_evictions`` — total entries removed. Includes capacity-overflow LRU evictions, apply-time mismatch pops (token, layer-count, splice failure), and ``clear_mru_partials()`` wipes. Does NOT include full ``clear()`` (cache-corruption recovery) — that path also calls ``reset_stats()``, so incrementing evictions there would be incoherent (the increment gets zeroed immediately). Operators tracking partial-only wipes use ``clear_mru_partials()``. - ``mru_partial_tokens_saved`` — sum of ``n_partial`` across hits. The direct compute-saved measure: each unit is one token of prefill forward-pass that did NOT have to run. Gauges (live state, on PrefixCacheStats) ---------------------------------------- - ``mru_partial_entries`` — current dict length. - ``mru_partial_max_entries`` — configured capacity (operator-facing). Plumbing -------- Counters thread through ``Scheduler._collect_cache_counters`` into the existing ``CacheRateTracker``. ``observability._compute_window`` and ``_compute_cumulative`` add per-counter deltas plus a derived ``mru_partial_hit_rate`` (= hits / stashes, with the same zero-stashes-no-NaN guard the other ratios use). The admin ``_build_runtime_cache_observability`` emits per-model entries/ max_entries gauges and aggregates them at the payload level the same way ``hot_cache_entries`` and ``hot_cache_max_bytes`` do. Dashboard --------- Mirrors jundot#1183's hot-cache surface: - **Header gauge** "MRU tails N/M entries" next to the Memory and SSD gauges, visible only when ``mru_partial_max_entries > 0``. - **Rate strip** gains "MRU Tail Hit Rate" and "MRU Tokens Saved" cells. Grid expands from 4 cells (hot cache only) to 6 cells (both tiers). When only one or the other tier is enabled, the layout stays at 4 cells with the disabled tier's cells hidden via ``x-show``. - **Per-model table** gains an "MRU Tails" column showing ``entries / max_entries`` for each loaded model. Test coverage (10 new cases) ---------------------------- ``TestMRUPartialCounters`` (8 cases) — initial zeros, stash-counter bumps, same-key-replacement-is-not-eviction, capacity-overflow eviction count, apply-success bumps hits+tokens_saved, apply-miss eviction, ``clear_mru_partials()`` bulk eviction count, ``clear()`` zeros everything semantics, ``reset_stats()`` zeros cumulative counters but preserves live entries. ``TestCacheRateTrackerRates`` (3 new cases) — mru_partial_hit_rate windowed + cumulative, zero-stashes no-NaN guard, tokens_saved delta accumulation. ``TestRuntimeCacheObservability`` updated to reflect the two new per-model payload keys (``mru_partial_entries``, ``mru_partial_max_entries``). Suite results on this commit: 250 passed: cache + scheduler + observability + admin + settings Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…s, Alpine getters Five small simplifications surfaced by a three-agent code review on the MRU partial cache commit stack. No behaviour change. **Hoist test helpers to module level.** Three MRU test classes (``TestMRUPartialBlockCache``, ``TestMRUPartialMultiSlot``, ``TestMRUPartialCounters``) each redefined ``_layer``, ``_kv_layer``, ``_rotating_layer``, ``_make_reconstructed_cache``, ``_stash_with_prefix``, and ``_cache`` (factory). Hoist to module-level helpers in ``tests/test_prefix_cache.py`` (alongside the existing ``_get_mru_partial`` accessor); the duplicate methods come out, ~120 lines of repetition collapses, call sites switch from ``self._kv_layer(...)`` to ``_kv_layer(...)``. The factory for custom capacity is renamed ``_make_mru_cache(paged_cache, mock_ssd, max_entries, num_layers)``. Per-class fixtures (``mx``, ``paged_cache``, ``mock_ssd``) stay class-local to avoid leaking fixture names into unrelated test classes in the same module. **Extract ``_evict_miss`` helper in ``apply_mru_partial``.** Five arms of ``self._mru_partials.pop(last_hash, None); self._mru_partial_evictions += 1; return cache, remaining_tokens, 0`` collapse into a single inner function. Each call site is now one line, the eviction-counter bookkeeping lives in one place, and the rollback contract is harder to break by accident. **Dashboard Alpine getters.** Three getters added to the dashboard root: ``mruEnabled``, ``hotCacheEnabled``, ``cacheRatesGridCols``. The previous expressions repeated ``stats.runtime_cache?.mru_partial_max_entries > 0`` in 8 places and a three-arm ternary chain in the rate-strip grid-class binding. The HTML now reads ``x-show="mruEnabled && stats.runtime_cache?.models?.length > 0"`` and ``:class="cacheRatesGridCols"`` at the relevant sites. **``_make_counters`` driven by a key tuple.** Previously took 16 explicit kwargs and re-listed every key in the returned dict body. Now: a module-level ``_COUNTER_KEYS`` tuple is the single source of truth; the helper builds a zero-initialised dict and applies ``**overrides``. Unknown keys raise (catches the typos the explicit signature used to catch). Adding a new observability counter is now one tuple entry instead of three coordinated changes (signature, dict body, and any call sites that wanted the default). **Prune ``_MRUPartialBlock`` docstring rot-prone bits.** Dropped specific MiB numbers (~17/41/68-165 MiB) and the PR# reference to jundot#1120 from the memory-accounting section. The numbers were informative when written but would have aged past the model and config landscape they assumed. Kept the invariant statement and the test reference; removed the calibration table. **Findings reviewed and skipped:** - Aggregating ``mru_partial_max_entries`` across loaded models was flagged as "wrong arithmetic" — actually correct, matches the deliberate hot-cache convention from jundot#1183 (each model has its own budget, the dashboard gauge shows fleet fill = sum of entries / sum of capacities). - ``_get_cache_seq_len`` per-block redundancy in ``store_cache`` — pre-existing pattern, not regressed by this stack, defer. - Phase-1 ``mx.concatenate`` × N layers × 2 dispatch shape — defer pending M3 Ultra measurement (per ``feedback_apple_silicon_perf.md`` memory: never estimate prefill costs). - ``paged_cache.allocated_blocks.get(...)`` direct dict access at 9+ pre-existing sites — wider refactor, out of scope. - ``_all_layers_sliceable`` vs ``_prompt_cache_needs_snapshots`` co-location — different inputs (class-name strings vs live cache objects), unification would be scope-creep. Suite results on this commit: 250 passed: cache + scheduler + observability + admin + settings. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
|
The PR has been rescoped to function alongside the hot cache in addition to the SSD cache. To avoid commit churn, I rebased on ivaniguarans's #1183 and used it to surface dashboard metrics for the LRU cache. Doing some additional local testing, then marking this as ready for review. (gated by merge of #1183) |
The admin dashboard reads MRU partial cache state via ``Scheduler.get_ssd_cache_stats`` -> ``BlockAwarePrefixCache .get_stats_dict``, which silently dropped every MRU field added by the observability-counters commit. With ``mru_partial_max_entries`` aggregating to 0 in the admin payload, the dashboard's ``mruEnabled`` gate stayed false and hid every panel — header gauge, rate-strip cells, and the per-model "MRU Tails" column — even when operators had the feature configured. The counter delta path was unaffected: ``Scheduler._collect_cache_counters`` reads the ``PrefixCacheStats`` dataclass via ``get_stats()``, so ``cache_rates.cumulative`` already carried ``mru_partial_hit_rate`` and ``mru_partial_tokens_saved``. The dashboard simply refused to render those cells because the gauge gate was false. Adds six fields to ``get_stats_dict()``: - ``mru_partial_stashes`` (counter) - ``mru_partial_hits`` (counter) - ``mru_partial_evictions`` (counter) - ``mru_partial_tokens_saved`` (counter) - ``mru_partial_entries`` (gauge, len(_mru_partials)) - ``mru_partial_max_entries`` (gauge, configured capacity) Regression test --------------- Adds ``TestMRUPartialCounters ::test_get_stats_dict_mirrors_dataclass_after_round_trip``, following the Pattern B mandate the class docstring already established (real ``store_cache`` round-trip rather than hand-built ``_MRUPartialBlock`` state). Exercises three stashes plus one successful apply against ``max_entries=2`` so every counter and gauge moves off zero, then asserts each MRU field on ``get_stats_dict()`` matches the corresponding field on the ``PrefixCacheStats`` dataclass. Sibling MRU counter tests already covered each counter individually via ``get_stats()`` (the dataclass), which is why the missing-keys regression slipped past them — none asserted on the dict surface. This test closes that loop and was verified to fail cleanly without the fix (``mru_partial_stashes missing from get_stats_dict()``). Suite: 139 passed (cache + observability). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
The MRU partial-block stash safety gate refuses any layer set containing a non-sliceable cache type (``RotatingKVCache``, ``PoolingCache``, ``ArraysCache``, ``CacheList``, etc.) — splicing a partial into a sliceable subset only would cause per-layer offset skew at decode (silent generation corruption). For affected models the entire MRU feature is structurally unavailable, but the dashboard previously rendered a misleading "0/N entries" gauge that left operators puzzling over an apparent config bug. Concrete case: DeepSeek-V4-Flash (every layer is ``CacheList(RotatingKVCache, PoolingCache, PoolingCache)``). Detection --------- Adds a tri-state ``mru_partial_supported`` flag to ``BlockAwarePrefixCache`` and ``PrefixCacheStats``: - ``None`` → unknown (no introspection has resolved it yet) - ``True`` → every observed layer is sliceable - ``False`` → at least one non-sliceable layer observed; every future stash attempt is refused at the safety gate Two detection paths feed the flag: 1. **Eager (load time):** ``_check_mru_eligibility_at_init`` calls ``model.make_cache()`` once at construction, extracts type names via ``ModelCacheConfig.from_cache_list``, and resolves the flag immediately. Best-effort: if make_cache is absent or raises, falls back to lazy detection without crashing. Cache instances are dropped via ``del`` after inspection — no tensor buffers are allocated (those arrive on first prefill), and Python wrappers are GC-reclaimed. 2. **Lazy (first inference):** ``_update_mru_partial`` checks ``_all_layers_sliceable(layer_cache_types)`` on each call. First non-sliceable observation latches the flag and emits the warning; first sliceable observation latches True. The warning fires exactly once per cache instance via ``_mru_partial_warn_emitted``. Operator log message includes the offending types and the sliceable whitelist so it's grep-actionable: WARNING omlx.cache.prefix_cache: MRU tail cache disabled for this model: layer types ['RotatingKVCache', 'PoolingCache'] are not in the sliceable whitelist [...]. Splicing a partial into a non-sliceable subset would cause per-layer offset skew at decode (silent generation corruption), so every stash attempt will be refused. The admin dashboard's per-model 'MRU Tails' cell will display 'N/A (see log)'. Dashboard surface ----------------- Per-model "MRU Tails" cell renders ``N/A (see log)`` when ``mru_partial_supported === false``; otherwise renders ``entries / max_entries`` as before. Hover tooltip references the server log. Global rate-strip cells (MRU Tail Hit Rate, MRU Tokens Saved) stay as aggregates — if any loaded model is compatible, those cells still surface its payoff. Tests (8 new in TestMRUPartialEligibility) ------------------------------------------ - ``supported_is_none_without_make_cache_and_no_inference`` - ``supported_latches_true_on_sliceable_observation`` - ``supported_latches_false_lazy_on_non_sliceable`` (round-trip via ``store_cache`` with ``_rotating_layer`` factory) - ``warning_does_not_repeat_on_subsequent_non_sliceable`` - ``eager_check_latches_false_at_init_with_non_sliceable_make_cache`` - ``eager_check_latches_true_at_init_with_sliceable_make_cache`` - ``eager_check_skipped_when_feature_disabled`` - ``eager_check_survives_make_cache_failure`` Existing ``TestRuntimeCacheObservability::test_runtime_cache _uses_model_scoped_ssd_stats`` updated to include the new per-model payload key (``mru_partial_supported: None`` for mocks that don't populate ``prefix_cache``). Suite: 215 passed (cache + observability + admin). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Pulls in 8 upstream commits, most relevantly: - 386e16f fix(tests): repair pre-existing upstream test failures and import guards (jundot#1244) — restores list-shaped GitHub releases payload in test_admin_update_check / test_admin_auth fixtures. Was committed upstream 2026-05-14 10:21, after this branch's previous merge of main and before the next. Branch was unknowingly running with these 4 tests failing the entire time. - 4fe004d feat: add Hermes Agent quick launch (jundot#1250) - ccfba1d fix(load): VLM model loading fixes for oQ-quantized checkpoints (jundot#1247) - 51907f0 fix(oq): restore MTP head attach for VLM sensitivity - and others Without jundot#1244 we keep inheriting the broken admin-auth / update-check tests as branch-only baseline failures. The fix landed 8 hours before today's MRU work and was never picked up because the branch hadn't merged main since. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
The model-load warning added in 5848c49 read like a developer note: it dumped the internal sliceable whitelist, explained the splice mechanism and the offset-skew failure mode, raised a "silent generation corruption" alarm, and narrated which dashboard cell would change. Rewrite it in the in-tree load-phase warning voice — "condition + consequence, plain words" — matching the ``mtp_enabled`` warning in ``utils/model_loading.py`` and the L2 warning in ``engine/dflash.py``: MRU tail cache enabled but this model is incompatible (cache layers: RotatingKVCache, PoolingCache); MRU tails will be inactive for this model. - Drop the whitelist dump — operators don't tune against it. - Drop the splice-mechanism rationale — that's an engineering explanation, not an operator decision point. It still lives in the ``_all_layers_sliceable`` docstring where developers read it; ``_record_mru_unsupported``'s docstring now points there. - Drop the dashboard self-reference — the dashboard already renders 'N/A (see log)'; the log shouldn't narrate the UI. - ``", ".join(offenders)`` instead of raw list repr, and ``"unknown"`` instead of ``"<unknown>"`` for the fallback. Tests assert on the new wording ("MRU tails will be inactive", "incompatible"). Would have folded into 5848c49 as an amend, but the merge of main now sits between that commit and HEAD. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
The MRU partial-block cache stashed the trailing partial of the stored sequence (``prompt + output``) keyed by that sequence's last full block, but a repeat request resubmits the *prompt* only and ``apply_mru_partial`` looks the entry up by the prompt's last full block. Output tokens shift every block boundary past the prompt's tail, so the two keys never coincide: ``_mru_partials.get(last_hash)`` always returned None. The feature produced zero hits for ordinary chat completions — observed live as "MRU tails 3/4 entries, MRU Tail Hit Rate 0.0%" with zero evictions (a key that is never found is never evicted). It only worked for reasoning models whose prompt ends with an open ``<think>`` tag, where ``needs_think_prefix`` makes ``store_cache`` persist the prompt alone. Fix --- Thread the prompt token count from the scheduler into ``store_cache`` -> ``_update_mru_partial``. The stash now keys off the prompt's last full block (block index ``prompt_len // block_size - 1``) and slices the prompt's trailing partial, not the stored sequence's. That block's hash is identical whether the sequence is blocked as ``prompt`` or ``prompt + output`` — block hashes are content-chained and the chain is byte-identical up to the prompt's partial tail — so the key the stash writes is exactly the key a prompt-only resubmission's ``apply_mru_partial`` computes. The arithmetic runs in the existing global-coordinate frame and accounts for ``existing_tokens > 0`` (on a resubmission ``store_cache`` appends to the fetched prefix block table and works in ``new_tokens`` space): ``partial_start = prompt_partial_start - existing_tokens``, with a guard for a prompt already fully covered by cached full blocks. Edge cases degrade cleanly — block-aligned prompt: no stash; prompt shorter than one block: ``None`` key (short-prompt path); ``prompt_token_count=None`` (generic ``CacheManager.store`` path): falls back to the whole stored sequence, reproducing the pre-fix behavior for verbatim-repeat callers. ``apply_mru_partial`` is unchanged — only the stash side was wrong. Draft cache ----------- ``_draft_prefix_cache`` (SpecPrefill) is constructed with ``mru_partial_max_entries=0``. ``apply_mru_partial`` is only ever called on the main ``block_aware_cache``, so a draft-cache stash was dead work that never paid off. Tests ----- New ``TestMRUPromptBoundaryStash`` (5 cases): - ``prompt_boundary_stash_hits_on_prompt_only_resubmit`` — the round-trip the original feature shipped without: store ``prompt + output`` with the boundary, resubmit prompt-only, assert a hit. - ``whole_sequence_stash_misses_on_prompt_only_resubmit`` — pins the original bug via the ``prompt_token_count=None`` path (0 hits, 0 evictions). - ``block_aligned_prompt_does_not_stash`` - ``short_prompt_stashes_under_none_key`` - ``prompt_boundary_stash_with_existing_cached_prefix`` — the ``existing_tokens > 0`` resubmission path. Diagnosis and approach were adversarially peer-reviewed; the review caught an ``existing_tokens``-relative indexing error in the first draft of the fix, corrected here. Full unit suite: 4401 passed (5 pre-existing upstream baseline failures unrelated to cache). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
``a42542f`` made ``apply_mru_partial`` produce hits for the first time — and immediately exposed a latent threading bug in the splice path, which had been dead code since the feature shipped. ``_extract_block_tensor_slice`` builds the partial's tensors via ``_clone_tensor`` (``mx.copy``), which is a *lazy* op. That op is created on the ``omlx-store-cache`` worker thread (where ``store_cache`` runs) and bound to that thread's MLX stream. ``apply_mru_partial`` splices the partial into a live cache on the separate ``mlx-global`` inference thread; generation's ``mx.async_eval`` then walks the compute graph back to the worker thread's stream, which the inference thread cannot see: RuntimeError: There is no Stream(gpu, 4) in current thread. The source ``_extracted_cache`` is already materialized before the worker runs (the inference thread batches ``_collect_arrays_from_extracted_cache`` through ``mx.async_eval`` and the worker calls ``mx.synchronize()``), so the fix is just to finalize the freshly-sliced copies: ``_update_mru_partial`` now calls ``_materialize_mru_kv`` on the extracted partial before stashing it. Because the inputs are already resident this is a small memcpy of the tail KV — no recompute — and it collapses the lazy ``mx.copy`` into concrete, stream-free data safe to splice and evaluate from any thread. ``apply_mru_partial`` itself is unchanged: ``add_request`` (the splice) and ``step`` (generation) both run on the single ``mlx-global`` worker, so the splice result never crosses threads — only the stashed input did. Tests ----- New ``TestMRUPartialCrossThreadSafety``: - ``materialize_mru_kv_handles_extract_shapes`` — the helper evaluates array leaves across the plain ``(keys, values)`` and TurboQuant ``(tag, (k, v))`` shapes and tolerates the non-array tag and an empty list. - ``stashed_partial_splices_across_threads`` — extract+stash on a worker thread, splice+evaluate on the main thread. Verified to fail without the fix (``no Stream(gpu, N) in current thread`` at the splice eval) and pass with it. The test pre-materializes ``cache_data`` to mirror production, where the inference thread always hands the worker an already-evaluated extracted cache. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
The header gauge in RUNTIME CACHE OBSERVABILITY sat alongside the Memory and SSD gauges, but it did not belong there. Memory and SSD each measure one exhaustible budget shared across every loaded model, so an aggregate fill bar is meaningful. MRU tail slots are allocated per-model — ``--mru-partial-max-entries`` applies to each model's own cache — so summing entries and max-entries across models produces a number that corresponds to no real resource. Per-model occupancy is already shown in the "MRU Tails" column of the per-model table, which is the correct granularity. - Remove the header gauge block from ``_status.html``. - Remove the now-dead ``runtimeMruPartialPercent`` getter from ``dashboard.js``. - Drop the payload-level ``mru_partial_entries`` aggregate in ``_build_runtime_cache_observability``. ``mru_partial_max_ entries`` is kept as a sum solely as the ``mruEnabled`` feature-on gate (drives the rate strip and the per-model column); it is no longer surfaced as a gauge value. Per-model ``mru_partial_entries`` / ``mru_partial_max_entries`` on each ``models[]`` entry are unchanged, as are the global "MRU Tail Hit Rate" and "MRU Tokens Saved" rate-strip cells (those are rates/counters, legitimately global). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
|
Thanks for the careful work, the transactional splice and hybrid carve-out shows you thought through the failure modes. My take is that avoiding at most ~2K tokens of partial-tail re-prefill (block_size upper bound) feels like a narrow win for the amount of new code this brings. Default 4 slots, hybrid models excluded, plus the permanent surface of transactional rollback, deferred-clear interaction, and hot-cache memory cotenancy is a lot for an optimization that prefix cache already covers at block boundaries. This is just my read though, curious how you see the tradeoff. If partial-tail recompute is hurting a specific workload you're targeting, that context would help me look at this differently. |
|
@jundot Prefill is the main performance weakness on Apple silicon: if there is any edge to be shaven off of prefill, no matter how marginal, I attack it on principle. 😀 My goal was to solidify the approach, measure performance difference with and without, then make a judgement on worth. I know it's a lot of code that might not justify the effort, but worst case scenario others will know not to try. The code approach has solidified. Just need to sit down and do the profiling...been distracted the last few days. I'll mark as ready to review when I have evidence. |
This PR proposes an in-memory cache for the trailing slice of prefill that cannot be committed to the disk cache. It captures the sub-block tail of a completed prefill (anything past the last full
block_sizeboundary) so that a resubmitted prompt skips the avoidable partial recompute that block-aligned caching otherwise forces.The cache is a bounded LRU dict keyed by
parent_hash, with default capacity 4 entries (configurable via--mru-partial-max-entries). Entries are written at the end ofstore_cachefor any prefill that produces a trailing partial, and consumed during cache-hit admission viaapply_mru_partial. On a successful splice the entry is promoted to the LRU tail; on capacity overflow the oldest entry is evicted. On any apply-time mismatch (different tail tokens, layer-count mismatch, splice failure), only the mismatched entry is evicted. Sibling entries for other prefixes are preserved.This is a memory-only structure, kept distinct from the disk and hot caches. No attempt is made to persist partials; the eviction discipline below ensures they cannot accumulate beyond capacity.
Scope
The optimization helps across the full range of single-prompt repeat and concurrent-prompt scenarios:
mru_partial_max_entriesdistinct active prefixes get the same benefit per prefix. LRU eviction handles workloads beyond capacity gracefully: the most recently used prefixes stay warm, older ones fall back to a full recompute on next hit (which is the pre-cache baseline, so the worst case is "no benefit," not "regression").The eviction gate on every mismatch protects correctness regardless of multiplicity; the operational footprint is bounded (see below).
Safety
The cache surface is small but the failure modes are sharp. The implementation:
RotatingKVCache,ArraysCache,BatchRotatingKVCache, and so on) disables stashing entirely. Splicing into only the sliceable layers would create per-layer offset skew at decode and silently corrupt generation on Gemma 3, Mistral, and any future hybrid. Gated via an explicitKNOWN_SLICEABLE_CACHE_TYPESwhitelist onomlx/cache/type_registry.py. Notably, the registry'ssupports_block_slicingflag is not trustworthy here.DefaultCacheHandlerfalls back to KVCache semantics and would report several unregistered batch/pool types as sliceable.BlockAwarePrefixCache.clear()wipes the whole dict (cache-corruption recovery path).None-keyed lookup, which would falsely match a short-prompt entry against an unrelated request whose parent is gone.PagedSSDCacheManagerinstance presence, not on whether SSD writes are happening. Underhot_cache_only=True(set via settings orOMLX_HOT_CACHE_ONLYenv), the disk writer thread is disabled but reconstruct still works via the hot tier's short-circuit; the MRU cache correctly remains active in that mode.Test coverage in
tests/test_prefix_cache.py::TestMRUPartialBlockCacheand::TestMRUPartialMultiSlotpins each of these, including a transactional-rollback test that mocksmx.concatenateto fail mid-loop and verifies no layer was mutated, a structural invariant test that pinskv_datatomx.arraystorage so future "optimize to CPU copy" regressions get caught, and parameterized LRU mechanics covering capacity eviction, apply-success promotion, and sibling preservation under mismatches.Operational note: deferred Metal cache clear
The post-completion deferred clear (#435, #557) is suppressed for one extra
_DEFERRED_CLEAR_DELAYwindow when any MRU entry is warm at the deadline. Warm partials are a strong predictor that the same prompt will return immediately and would benefit from the still-resident lazy KV tensors. The suppression is bounded at one suppression per deferral epoch (a fresh budget arms only on transition from_deferred_clear_at is None); the next deadline fires regardless of MRU state. This avoids the failure mode where hot-prompt repeats refresh the budget faster than_DEFERRED_CLEAR_DELAYand the pool-bloat mitigation (#411) is silently defeated.Memory accounting
Each entry holds real
mx.arrayallocations (viamx.copy) and counts automatically againstmx.get_active_memory(). Every runtime memory enforcement and telemetry path in this codebase (process enforcer, scheduler limit checks, periodic-clear threshold) reads from there. Worst case is oneblock_sizeof KV per entry, held alive between completions and admissions.Under
hot_cache_only=True, the hot cache and the MRU dict share the same in-memory KV headroom envelope. Operators running that mode at high--mru-partial-max-entriesshould size--hot-cache-max-sizeaccordingly: the two settings are co-tenants of the same budget, not independent dials. Default 4 is conservative enough that this rarely matters in practice.Configuration
--mru-partial-max-entries N(default4, matching the dflashmax_entriesprecedent in #1120) sets the maximum simultaneous entries.0disables the feature entirely (silent fallback to "no MRU" behavior, mirroring the--hot-cache-max-size 0convention). Also available viamru_partial_max_entriesinCacheSettings. Operators with high-concurrency workloads can opt up; operators with memory pressure can opt down or off.Admin endpoint symmetry
The
/api/ssd-cache/clearadmin endpoint now also wipes MRU partials for each loaded scheduler, via a newBlockAwarePrefixCache.clear_mru_partials()method. Without this, partials would chain from paged-block hashes whose KV bytes were just flushed by the endpoint, violating the operator's "drop all warm caches" intent. The sameclear_mru_partials()hook is the seam intended for the future/api/hot-cache/clearendpoint introduced in #1183: a one-line addition at the same loop once that PR lands.Observability
The MRU partial cache plugs into the same observability surface #1183 established for the prefix and hot/disk tiers, so operators tune
--mru-partial-max-entriesfrom the same dashboard they use for memory and disk hit rates.New counters on
PrefixCacheStats:mru_partial_stashes(entry writes, including same-key replacements),mru_partial_hits(successful splices),mru_partial_evictions(capacity-overflow + apply-miss + admin-clear wipes; the cache-corruptionclear()path intentionally zeros all counters viareset_stats()rather than incrementing here),mru_partial_tokens_saved(the direct compute-saved measure: prefill tokens that did not have to be re-run). New gauges:mru_partial_entriesandmru_partial_max_entries. Derivedmru_partial_hit_rate = hits / stashes— the "stash payoff" ratio — surfaces both windowed and cumulative inCacheRateTracker.The dashboard mirrors the hot-cache pattern: a header gauge ("MRU tails N/M entries"), rate-strip cells for the hit rate and tokens-saved counter, and a per-model column. All gated on
mru_partial_max_entries > 0so default-off configurations don't see the surface.Base
This branch is based on #1183 to use the cache-tier observability and hot-cache architecture as the integration substrate. Targeting merge after #1183.